feat(core)!: more explicit handling of case-sensitivity in dictionaries #2630

86xsk · 2026-01-30T02:59:18Z

Issues

Fixes #2585
Related to #1688
Related to #2411

(Probably a few more, this has been a long-running issue.)

Description

Changes the dictionary API to allow more explicitly handling capitalization/orthographic word variations.
- Replaces WordId with CanonicalWordId and CaseFoldedWordId.
  - CanonicalWordId hashes the input word as is, without lowercasing or normalization.
  - CaseFoldedWordId lowercases and normalizes the word before hashing it (this is the current behavior of WordId).
  - Adds WordIdPair and EitherWordId to make it easier to work with CanonicalWordId/CaseFoldedWordId.
- Querying the dictionary case-insensitively now means you will get a Vec of words, not an individual word.
  - However, the current behavior, where a word is queried case-insensitively and returns the merged metadata across all of its casing/orthographic variants is preserved in the get_word_metadata_combined functions.
- Querying the dictionary case-sensitively will now return a specific entry in the curated dictionary, allowing differentiation for case/orthographic variants when desired.
Fixes the issue where some words would be linted by SpellCheck despite being present in the dictionary.
Makes SpellCheck no longer care about capitalization, fully transferring that responsibility to OrthographicConsistency (which already handles it anyway for the most part).
Re-adds "Pr" to the curated dictionary, since it no longer conflicts with "PR".

How Has This Been Tested?

cargo test
Manually building and testing the modified harper-ls on a few Markdown files

Checklist

I have performed a self-review of my own code
I have added tests to cover my changes

`SpellCheck` shouldn't handle capitalization if `OrthographicConsistency` is going to do it anyway.

- Splits `WordId` into `CanonicalWordId` and `CaseFoldedWordId`. - Updates dictionary functions to more explicitly handle casing. There are now functions to get a specific word case-sensitively, multiple words case-insensitively, and ditto but merge all metadata (which was the old behavior). - Fixes issue where `SpellCheck` would sometimes mark words as incorrect if an identical entry with different casing existed in the dictionary (e.g. OS, PR, etc.). - Makes `SpellCheck` no longer care about casing, since that is handled by `OrthographicConsistency`.

…mattic#2476)" This partially reverts commit 5230d6a. Returns the word to the dictionary, since removing it should no longer be necessary.

Since casing-related issues are now handled by `OrthographicConsistency`, not `SpellCheck`.

Expands the criteria in which `OrthographicConsistency` will lint for incorrect capitalization. Makes the `no_improper_suggestion_for_macos` test pass.

Allow all word casing/orthography that are defined in the dictionary. If the dictionary contains the exact word, `OrthographicConsistency` will skip it.

Both variants are defined in the dictionary, and appear to be valid in this case.

The test would expect 'Al' to be linted by `OrthographicConsistency` for not being all-caps.

Remove needless borrow.

hippietrail · 2026-01-30T09:47:33Z

Does it handle the same word having multiple WordIds? For instance you might have dictionary entries like:

- unpack/Vd # also creates "unpacked"
- packed/VtTU # also creates "unpacked
- unpacked/J # specifically creates "unpacked"

This happens a lot more than expected and causes some bugs of this type to crop up. I had a PR for it for many months that just tracked them but didn't really do anything with them as I wasn't sure what was and wasn't planned etc.

86xsk · 2026-01-30T18:29:09Z

I believe it will merge the metadata for words that have identical canonical word IDs, as seems to be the case here.

The code in expand_annotated_word adds or appends metadata based on the canonical word ID (i.e. the hash of the word exactly as stated in the dictionary or as expanded from another word, without lowercasing or normalization). "unpacked" will always have the same word ID regardless of whether it was specifically defined or came from an expansion, so all instances/manifestations of "unpacked" will have their metadata merged.

If in theory we had, say, a proper noun "Unpacked" (whether specifically defined or coming from an expansion), that would not be merged. It would remain a separate entry that can be queried with the appropriate dictionary functions.

However, tokens that originate from a Document still currently use the merged/combined metadata. So an "unpacked" token would have both the "unpacked" and "Unpacked" metadata merged.

// In Document::parse()
// Annotate DictWord metadata
for token in sent.iter_mut() {
    if let TokenKind::Word(meta) = &mut token.kind {
        let word_source = token.span.get_content(&self.source);
        let mut found_meta = dictionary
            .get_word_metadata_combined(word_source) // <---
            .map(|c| c.into_owned());

        if let Some(inner) = &mut found_meta {
            inner.pos_tag = token_tags[i].or_else(|| inner.infer_pos_tag());
            inner.np_member = Some(np_flags[i]);
        }

        *meta = found_meta;
        i += 1;
    } else if !token.kind.is_whitespace() {
        i += 1;
    }
}

hippietrail · 2026-01-31T05:28:23Z

I don't recall everything perfectly at the moment but both I think both cases are possible and only one is handled:

One WordID can refer to multiple words - due to case folding
One word can have multiple WordIDs

I believe you're talking about 1. which works but is behind a few bugs. But 2. is distinct. I found my old PR: #1035

Apologies if I've mixed up any concepts myself - it's been a while (-:

86xsk · 2026-01-31T16:50:25Z

Oh, I see what you mean now. Yeah no, at the moment I think this PR mostly sticks with the old behavior where once a DictWordMetadata has its one and onlyderived_from set, it's not updated again.

I did change the field to store a WordIdPair (storing both a canonical and case-folded word ID for its sole "parent" word), but I don't think that helps too much in this case. If I recall correctly, I did this so I could mimic the previous case-folded lookup in the DerivedFrom pattern, while still storing a canonical parent word ID too.

I'll take another look though. I wasn't too certain with that change as is, and the PR you mentioned does bring up an issue I haven't even considered.

86xsk added 19 commits January 20, 2026 01:12

test(core): add failing test

008c534

test(comments): don't expect 'lin' to be marked as a spelling error

a5f4a1b

test(core): move tests

5face23

`SpellCheck` shouldn't handle capitalization if `OrthographicConsistency` is going to do it anyway.

test(core): don't expect SpellCheck to mark capitalization issues

fe37ba6

deps(core): add indexmap

6f02acd

chore: update snapshots

ce666c5

Partially revert "fix(core): PR getting flagged as 'misspelled' (Auto…

4443a4e

…mattic#2476)" This partially reverts commit 5230d6a. Returns the word to the dictionary, since removing it should no longer be necessary.

test(core): merge tests and add test

e982f23

Merge branch 'master' into fix-dict-casing2

5688aee

test(core): move test

0a1e2d4

Since casing-related issues are now handled by `OrthographicConsistency`, not `SpellCheck`.

fix(core): fix logic in OrthographicConsistency

cf9a90a

Expands the criteria in which `OrthographicConsistency` will lint for incorrect capitalization. Makes the `no_improper_suggestion_for_macos` test pass.

test(core): add failing test

007df6e

fix(core): allow all case-variants in OrthographicConsistency

7518350

Allow all word casing/orthography that are defined in the dictionary. If the dictionary contains the exact word, `OrthographicConsistency` will skip it.

test(core): remove Lego -> LEGO test in OrthographicConsistency

3c9d54e

Both variants are defined in the dictionary, and appear to be valid in this case.

chore: update snapshots

8b426d9

test(core): add test

3381fba

test(core): fix incorrect test expectation

e11a2d6

The test would expect 'Al' to be linted by `OrthographicConsistency` for not being all-caps.

refactor(core): appease Clippy

b23f652

Remove needless borrow.

86xsk marked this pull request as draft January 31, 2026 17:30

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(core)!: more explicit handling of case-sensitivity in dictionaries #2630

feat(core)!: more explicit handling of case-sensitivity in dictionaries #2630

86xsk commented Jan 30, 2026

Uh oh!

hippietrail commented Jan 30, 2026 •

edited

Loading

Uh oh!

86xsk commented Jan 30, 2026 •

edited

Loading

Uh oh!

hippietrail commented Jan 31, 2026 •

edited

Loading

Uh oh!

86xsk commented Jan 31, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

feat(core)!: more explicit handling of case-sensitivity in dictionaries #2630

Are you sure you want to change the base?

feat(core)!: more explicit handling of case-sensitivity in dictionaries #2630

Conversation

86xsk commented Jan 30, 2026

Issues

Description

How Has This Been Tested?

Checklist

Uh oh!

hippietrail commented Jan 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

86xsk commented Jan 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

hippietrail commented Jan 31, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

86xsk commented Jan 31, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

hippietrail commented Jan 30, 2026 •

edited

Loading

86xsk commented Jan 30, 2026 •

edited

Loading

hippietrail commented Jan 31, 2026 •

edited

Loading